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Abstract. In this paper we present a new problem, the fast set inter- 
section problem, which is to preprocess a collection of sets in order to 
efficiently report the intersection of any two sets in the collection. In 
addition we suggest new solutions for the two-dimensional substring in- 
dexing problem and the document listing problem for two patterns by 
reduction to the fast set intersection problem. 



1 Introduction and Related Work 

The intersection of large sets is a common problem in the context of retrieval 
algorithms, search engines, evaluation of relational queries and more. Relational 
databases use indices to decrease query time, but when a query involves two 
different indices, each one returning a different set of results, we have to intersect 
these two sets to get the final answer. The running time of this task depends 
on the size of each set, which can be large and make the query evaluation take 
longer even if the number of results is small. In information retrieval there is a 
great use of inverted index as a major indexing structure for mapping a word 
to the set of documents that contain that word. Given a word, it is easy to get 
from the inverted index the set of all the documents that contain that word. 
Nevertheless, if we would like to search for two words to get all documents that 
contain both, the inverted index doesn't help us that much. We have to calculate 
the occurrences set for each word and intersect these two sets. The problem of 
intersecting sets finds its motivation also in web search engines where the dataset 
is very large. 

Various algorithms to improve the problem of intersecting sets have been in- 
troduced in the literature. Demaine et al. [1] proposed a method for computing 
the intersection of k sorted sets using an adaptive algorithm. Baeza- Yates [2] pro- 
posed an algorithm to improve the multiple searching problem which is related 
directly to computing the intersection of two sets. Barbay et al. [3] showed that 
using interpolation search improves the performance of adaptive intersection al- 
gorithms. They introduced an intersection algorithm for two sorted sequences 
that is fast on average. In addition Bille et al. [4] presented a solution for com- 
puting expressions on given sets involving unions and intersections. A special 
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case of their result is the intersection of m sets containing N elements in total, 
which they solve in expected time O (N (log uj) 2 /co + m ■ output) for word size ui 
where output is the number of elements in the intersection. 

In this paper we present a new problem, the fast set intersection problem. 
This problem is to preprocess a databases of size N consisting of a collection of 
m sets to answer queries in which we are given two set indices i, j < m, and wish 
to find their intersection. This problem has lots of applications where there is a 
need to intersect two sets in a lot of different fields like Information Retrieval, 
Web Searching, Document Indexing, Databases etc. An optimal solution for this 
problem will bring better solutions to various applications. 

We solve this problem using minimal space and still decrease the query time 
by using a preprocessing part. Our solution is the first non-trivial algorithm for 
this problem. We give a solution that requires linear space with worst case query 
time bounded by 0(\f N output + output) where output is the intersection size. 

In addition, we present a solution for the two-dimensional substring indexing 
problem, introduced by Muthukrishnan et al. [5] . In this problem we preprocess 
a database D of size N. So when given a string pair (cti, er 2 ), we wish to return 
all the database string pairs on G D such that o\ is a substring of o^i and er 2 
is a substring of a^. Muthukrishnan et al. suggested a tunable solution for this 
problem which uses 0(N 2 ~ y ) space for a positive fraction y and query time of 
0(N y + output) where output is the number of such string pairs. We present 
a solution for this problem, based on solving the fast set intersection problem, 
that uses 0(N log N) space with 0((y/N log N output + output) log 2 N) query 
time. 

In the document listing problem which was presented by Muthukrishnan [6] , 
we are given a collection of size N of text documents which may be preprocessed 
so when given a pattern p we want to return the set of all the documents that con- 
tain that pattern. Muthukrishnan suggested an optimal solution for this problem 
which requires O(N) space with 0(\p\ + output) query time where output is the 
number of documents that contain the pattern. However, there is no optimal so- 
lution when given a query consists of two patterns p, q to return the set of all the 
documents that contain them both. The only known solution for this problem is 
of Muthukrishnan [6] which suggested a solution that uses 0(Ny/N) space which 
supports queries in time 0(\p\ + \q\ + VN + output). We present a solution for 
the document listing problem when the query consists of two patterns. Our solu- 
tion uses 0(iV log N) space with 0(\p\ + \q\ + (y/N log N output + output) log 2 N) 
query time. 

The paper is structured as follows: In Sect. 2 we describe the fast set intersec- 
tion problem. In Sect. 3 we describe our solution for this problem. In Sect. 4 we 
present similar problems with their solutions. In Sect. 5 we present our solution 
for the two-dimensional substring indexing problem and the document listing 
problem for two patterns. In Sect. 6 we present some concluding remarks. 



2 Fast Set Intersection Problem 



We formally define the fast set intersection (FSI) problem. 

Definition 1. Let D be a database of size N consisting of a collection of m 
sets. Each set has elements drawn from 1 . . . c. We want to preprocess D so that 
given a query of two indices i,j < m, we will be able to calculate the intersection 
between sets i,j efficiently. 

A naive solution for this problem is to store the sets sorted. Given a query 
of two sets go over the smaller set and check for each element if it exists in 
the second set. This costs 0(min(\i\, \j\) log max(\i\, \j\)). This solution can be 
further improved using hash tables. A static hash table [7] can store n elements 
using 0(n) space and build time, with O(l) query time. For each set we can 
build a hash table to check in O(l) time if an element is in the set or not. 
This way the query time is reduced to 0(min(\i\, \j\)) using linear space. The 
disadvantage of using this solution is that on the worst case we go over a lot 
of elements even if the intersection is small. A better query time can be gained 
by using more space for saving the intersection between every two sets. Using 
0(m 2 c) space we get an optimal query time of O (output) where output is the 
size of the intersection. Nevertheless, this solution uses extremely more space. 
In the next section we present our solution for the fast set intersection problem 
which bounds the query time on the worst case. 

3 Fast Set Intersection Solution 

Here we present our algorithm for solving the FSI problem. We call result set to 
the output of the algorithm, i.e., the intersection of the two sets. By output we 
denote the size of the result set. 

3.1 Preprocessing 

For each set in D we store a hash table to know in O(l) time if an element is in 
that set or not. In addition, we store the inverse structure, i.e., for each element 
we store a hash table to know in O(l) time if it belongs to a given set or not. 

Our main data structure consists of an unbalanced binary tree. Starting from 
the root node at level 0, each node in that tree handles number of subsets of the 
original sets from D. The cost of a node in that tree is the sum of the sizes of 
all the subsets it handles. The root node handles all the m sets in D, therefore, 
it costs N. 

Definition 2. Let d be a node which costs n. A large set in d is a set which has 
more than ^/n elements. 

Lemma 1. By definition, a node d which costs n, can handle at most yjn large 
sets. 



A set intersection matrix is a matrix that stores for each set if it has an in- 
tersection with any other set. For fn sets this matrix costs 0(ra 2 ) bits space with 
O(l) query time for answering if set i and set j have a non-empty intersection. 

For each node we construct a set intersection matrix for the large sets in that 
node. By lemma 1, saving the set intersection matrix only for the large sets in a 
node that costs n space will cost only another n space. 

Now we describe how we divide sets between the children of a node. Only 
large sets in a node will be propagated down to its two children, we call them 
the propagated group. Let d be a node which costs n and let G be its propagated 
group. Then, G costs at most n as well. Let E be the set of all elements in the 
sets of G. We partition E into two disjoint sets E\,E2- For a given set S € G 
we partition it between the two children as following: The left child will handle 
S fl E\ and the right child will handle S n E 2 . We want each child of d to cost 
at most 7f . Nevertheless, finding such a partition of E is a hard problem, if even 
possible at all. To overcome this difficulty we shall add elements to E\ until 
adding another element will make the left child cost more than ^. The next 
element, which we denote by e, will be remarked in d for checking, during query 
time, whether it lies in the intersection. We now take Ei = E — E\ — {e} , i.e., 
the remaining elements. This way each child costs at most l|. 

A leaf in this binary tree is a node which is in constant size. Because each 
node in the tree costs half the space of its parent then this tree has log TV levels. 

Theorem 1. The space needed for this data structure is O(N) space. 

Proof. The hash tables for all the sets cost O(N) space. As well the inverse hash 
tables for all the elements cost O(N) space. 

The binary tree structure space cost is as follows: The root costs O(N) bits 
for saving the set intersection matrix. In each level we store only another O(N) 
bits because every two children don't cost more than their parent. Hence, the 
total cost of this tree structure is 0(N log N) bits which is O(N) space in term 
of words. □ 



3.2 Query Answering 

Given sets i,j (without loss of generality we assume |z| < \j\), we start traversing 
the tree from the root node. If i is not a large set in the root we check each 
element from it in the hash table of j. As there can be at most y/~N elements in 
i because it is not a large set, this will cost 0(y/~N). If both i, j are large sets we 
do as follows: We check in the set intersection matrix of the root wether there 
is a non-empty intersection between i and j. If there is not there is nothing to 
add to the result set so we stop traversing down. If there is an intersection we 
check the hash table of the element which is remarked in that node if it belongs 
to i and j and add that element to the intersection if it belongs to both. Next 
we go down to the children of the root and continue the traversing recursively. 

Elements are added to the result set when we get to a node which in that 
node i is not a large set. In this case, we stop traversing down the tree from that 



node. Instead we step over all the elements of i in that node checking for each 
one of them if it belongs to j. We call such a node a stopper node. 



Theorem 2. The query time is bounded by 0(V 'Noutput + output). 



Proof. The query computation consists of two parts. The tree traversal part and 
the time we spend on stopper nodes. 

There are output elements in the result set, therefore, there can be at most 
O(output) stopper nodes. Because the tree height is log N, for each stopper node 
we visit at most log N nodes for the tree traversal until we get to it. Therefore, 
the tree traversal part adds at most 0(outputlogN) to the query time. But this 
is more than what we actually pay for the tree traversal because some stopper 
nodes share their path from the root. This can be bounded better. Because 
the tree is a binary tree if we fully traverse the tree till log output height it 
will cost O {output) time. Now, from this height if we continue traverse the tree 
we visit for each stopper node at most log TV — log output nodes because we 
are already at log output height. Thus, the tree traversal part is bounded by 
0(output + output(\ogN — log output)). By log rules this equals to 0(output + 

Now, we calculate how much time we spent on all the stopper nodes. A 
stopper node is a node which during the tree traversal we have to go over all 
elements of a non-large set in that node. The size of a non-large set in a stopper 

at level I is \J~$- Consider there are x stopper nodes. We denote by k the level 
for stopper node i. For all stopper nodes we pay at most: 
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The Cauchy-Schwarz inequality is that (X)" =1 ^iVi) 1 < C"=i x i )(Z)"=i vf )■ We 
use it in our case to get: 
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Kraft inequality from Information Theory states that for any binary tree: 

V y t^—depth{l) ^ 
l£leaves 



Because we never visit a subtree rooted by a stopper node, then in our case 
each stopper node can be viewed as a leaf in the binary tree. Therefore, we can 



transform Kraft inequality for all the stopper nodes instead of all tree leaves to 
get that J2i=i < 1- Using this inequality gives us that: 



< VN \fx = V Nx < \J N output = output* 



N 



output 



Thus, we pay O (output ou ^ put ) , for the time we spend in the stopper nodes. 

Therefore, the tree traversal part and the time we spend on all stopper nodes 
is 0(output + output log ou ^ put + output^ ' ou f put )- Hence, the hnal query time is 
bounded by 0(\/N output + output). □ 

Corollary 1. The fast set intersection problem can be solved in linear space 
with worst case query time of 0(yJN output + output). 



4 Intersection-Empty Query and Intersection-Size Query 

In the FSI problem given a query we want to return the result set, i.e., the 
intersection between two sets. What if we only want to know if there is any in- 
tersection between two sets? We call that the intersection-empty query problem. 
Moreover, sometimes we would like only to know the size of the intersection 
without calculating the actual result set. We define these problems as follows: 

Definition 3. Let D be a database of size N consisting of a collection of m sets. 
Each set has elements drawn from 1 . . . c. The intersection-empty query problem 
is to preprocess D so that given a query of two indices i,j < m, we want to 
calculate if sets i,j have any intersection. In the intersection-size query problem 
when given a query we want to calculate the size of the result set. 

A naive solution for the intersection-empty query problem is to build a ma- 
trix saving if there is any intersection between every two sets. This solution uses 
0(m 2 ) bits space with query time of 0(1). For the intersection-size query prob- 
lem we store the intersection size for every two sets by using slightly more space, 
0(m 2 ) space, with query time of 0(1). 

We can use part of our FSI solution method to solve the intersection-empty 
query problem using O(N) space with 0(y/~N) query time. Instead of the whole 
tree structure we store only the root node with its set intersection matrix using 
0(N) space. Given sets i,j (without loss of generality let's assume \i\ < \j\), if 
i is not large set in the root we check each element from it in the hash table of 
j. Because i is not large set, this will cost at most 0(y/N) time. If i is a large 
set then we check in the set intersection matrix of the root to see if there is 
any intersection in 0(1) time. Hence, we can solve the intersection-empty query 
problem in 0(y/N) time using 0(N) space. 

With the same method we can solve the intersection-size query problem by 
saving the size of the intersection instead of saving if there is any intersection 
in the set intersection matrix. This way we can solve the intersection-size query 
problem in 0(y/N) time using O(N) space. 



5 Two-Dimensional Substring Indexing Solution 



In this section, we show how to solve the two-dimensional substring indexing 
problem and the document listing problem for two patterns using our FSI solu- 
tion. The two-dimensional substring indexing problem was showed by Muthukr- 
ishnan et al. [5]. It is defined as follows: 

Definition 4. Let D be a database consisting of a collection of string pairs 
oti = (cKi,i) 1 < i < c, which may be preprocessed. Given a query string pair 
{o~i, 02), the 2-d substring indexing problem is to identify all string pairs a-i € D, 
such that Gi is a substring o/o^i and a 2 is a substring of 0^2- 

Muthukrishnan et al. [5] reduced the two-dimensional substring indexing 
problem to the common colors query problem which is defined as follows: 

Definition 5. We are given an array A[l . . . N] of colors drawn from 1 . . . C. 
We want to preprocess this array so that the following query can be answered 
efficiently: Given two non-overlapping intervals I\,I 2 in [1,-/V], list the distinct 
colors that occur in both intervals I\ and I 2 ■ 

The common colors query (CCQ) problem is another intersection problem 
where we have to intersect two intervals on the same array. We now show how 
to solve the CCQ problem by solving the FSI problem. By that we solve the 
two-dimensional substring indexing problem as well. 

Given array A of size N, we build a data structure consisting of log N levels 
over this array. In the top level we partition A into two sets of size at most y 1 the 
first set containing colors, i.e., elements, of A in range A[l . . . -y] and the second 
set containing colors in range + 1 . . . N}. As well, each level i is partitioned 
into 2 l sets, each respectively, containing a successive set of ¥■ colors from A. 
The bottom level, in similar fashion, is therefore partitioned into N sets each 
containing one different color from array A. The size of all the sets in each level 
is O(N). Therefore, the size needed for all the sets in all levels is 0(N log N). 

Lemma 2. An interval I on A can be covered by at most 2 log TV sets. 

Proof Assume, by contradiction, that there exists an interval for which at least 
m > 2 log N sets are needed. This implies that there is some level that at least 
3 (consecutive) sets are selected. However, for every 2 consecutive sets there 
have to be a set in the upper level that contains them both, so we can take it 
instead, and cover the same interval with only to — 1 sets, in contradiction to 
the assumption that at least m sets are required for the cover. □ 

Theorem 3. The CCQ problem can be solved using 0(N log N) space with 
0((y/N log N output + output) log 2 TV) query time where output is the number 
of distinct colors that occur in both I\ and I 2 ■ 



Proof. Given two intervals I\ , I2 we want to calculate their intersection, By 
lemma 2, I\ , I2 are each covered by a group of 2 log n sets at the most. To get the 
intersection of I\ , I 2 we will take each set from the first group and intersect it with 
each set from the second group using our FSI solution. Hence, we have to solve 
the FSI problem (3(log 2 N) times. Our FSI solution takes 0(y/ N output + output) 
time and O(N) space for dataset which costs O(N) space. Here the dataset costs 
0(N log N) space, therefore, we can solve the common colors query problem in 
O {{y/N log N output + output) log 2 N) time using 0(N log N) space. □ 

As showed in [5] to solve the two-dimensional substring problem we can 
solve a CCQ problem. As a result, the two-dimensional substring problem can 
be solved in 0{ {\/N log N output + output) log 2 N) time using 0(N log N) space. 

5.1 Document Listing Solution For Two Patterns 

The document listing problem was presented by Muthukrishnan [6] . In this prob- 
lem we are given a collection D of text documents d\, . . . , d c , with J2i Mil = ^> 
which may be preprocessed, so when given a query comprising of a pattern p 
our goal is to return the set of all documents that contain one or more copies 
of p. Muthukrishnan presented an optimal solution for this problem by building 
a suffix tree for D, searching the suffix tree for p and getting an interval / on 
an array with all the occurrences of p in D. Then they solve the colored range 
query problem on I to get each document only once. This solution requires O(N) 
space with optimal query time of 0(\p\ + output) where output is the number of 
documents that contain p. 

We are interested in solving this problem for a two patterns query. Given two 
patterns p, q, our goal is to return the set of all documents that contain both p 
and q. In [6] there is a solution that uses 0(Ny/N) space with 0(\p\ + \q\ + VN + 
output) query time. Their solution is based on searching a suffix tree of all the 
documents for the two patterns p, q in 0(\p\ + \q\) time. From this they get two 
intervals: I\ with p occurrences and I2 with q occurrences.. On these intervals 
they solve a CCQ problem to get the intersection between Ii and I 2 for all the 
documents that contain both p and q. 

We suggest a new solution based on solving the FSI problem. We use the same 
method as Muthukrishnan [6] until we get the two intervals: I\ with p occurrences 
and I 2 with q occurrences. Now, we have to solve a CCQ problem which can be 
solved as shown above in theorem 3. Therefore, the document listing problem for 
two patterns can be solved in 0(\p\ + \q\ + (yj N log N output + output) log 2 N) 
time using 0(N log N) space where output is the number of documents that 
contain both p and q. 

6 Conclusions 

In this paper we developed a method to improve algorithms which intersects 
sets as a common task. We solved the fast set intersection problem using O(N) 



space with query time bounded by 0(V 'Noutput + output). We showed how to 
improve some other problems, the two-dimensional substring indexing problem 
and the document listing problem for two patterns, using the fast set intersection 
problem. 

There is still a lot of research to be done in regards to the fast set intersection 
problem. It is open if the query time can be bounded better. Moreover, we showed 
only two applications for the fast set intersection problem. We are sure that the 
fast set intersection problem can be useful in other fields as well. 
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